Estimating cutoff values for diagnostic tests to achieve target specificity using extreme value theory

Background Rapidly developing tests for emerging diseases is critical for early disease monitoring. In the early stages of an epidemic, when low prevalences are expected, high specificity tests are desired to avoid numerous false positives. Selecting a cutoff to classify positive and negative test results that has the desired operating characteristics, such as specificity, is challenging for new tests because of limited validation data with known disease status. While there is ample statistical literature on estimating quantiles of a distribution, there is limited evidence on estimating extreme quantiles from limited validation data and the resulting test characteristics in the disease testing context. Methods We propose using extreme value theory to select a cutoff with predetermined specificity by fitting a Pareto distribution to the upper tail of the negative controls. We compared this method to five previously proposed cutoff selection methods in a data analysis and simulation study. We analyzed COVID-19 enzyme linked immunosorbent assay antibody test results from long-term care facilities and skilled nursing staff in Colorado between May and December of 2020. Results We found the extreme value approach had minimal bias when targeting a specificity of 0.995. Using the empirical quantile of the negative controls performed well when targeting a specificity of 0.95. The higher target specificity is preferred for overall test accuracy when prevalence is low, whereas the lower target specificity is preferred when prevalence is higher and resulted in less variable prevalence estimation. Discussion While commonly used, the normal based methods showed considerable bias compared to the empirical and extreme value theory-based methods. Conclusions When determining disease testing cutoffs from small training data samples, we recommend using the extreme value based-methods when targeting a high specificity and the empirical quantile when targeting a lower specificity. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-023-02139-5.


Introduction
When faced with an emerging infectious disease outbreak, it is imperative to rapidly develop diagnostic tests to determine individual disease status and estimate community prevalence.Both individual-and communitylevel information is necessary to target public health interventions and deploy medical resources.In addition to designing tests that accurately measure biological samples for evidence of disease (e.g., antibodies), a critical challenge is how to classify quantitative test results as positive or negative.Therefore, a threshold, based on controls with known disease status, must be selected to determine positive and negative test results.
Estimating cutoffs for newly developed tests provides unique challenges.First, tests can show little separation in the distributions for positive and negative controls.The threshold can be chosen to target a particular sensitivity or specificity, but not both.Second, many early tests have a limited number of controls with known disease status.For example, a study found that of 47 coronavirus disease 2019 (COVID- 19) antibody tests used in developing countries, the majority had fewer than 200 negative controls and some had as few as 31 [1].Thus, estimating the cutoff that will have the desired sensitivity or specificity must be done from limited data.
This raises two important questions.First, what sensitivity or specificity should be targeted?Second, how to best estimate a cutoff value for the target sensitivity or specificity?For emerging diseases, we expect the prevalence to be low.Thus, to optimize the number of tests with the correct result, we should prioritize correctly identifying negative results and consequently have a high specificity [2,3].For this reason, the Centers for Disease Control and Prevention (CDC) recommended high specificity, such as 0.995, for tests developed in the early part of the COVID-19 pandemic [4].To achieve a target specificity, researchers commonly use the same quantile of the negative controls distribution as a cutoff.Hence, the objective is to estimate the 0.995 quantile of a likely skewed distribution from limited training data.Two common approaches to estimating a quantile of the negative controls are to use the empirical quantile or use the quantiles of a parametric distribution, such as normal or lognormal, fitted to the data [3,[5][6][7].However, these methods have not been specifically evaluated for selecting cutoffs of rapidly developed tests for emerging diseases and the resulting test characteristics.
We provide two contributions to the literature.First, we propose using methods from extreme value theory literature to estimate a cutoff for a desired target specificity.Our proposed approach is to fit a generalized Pareto distribution to the upper tail of the negative control data [8].This approach has been broadly used to estimate extreme values of events such as rainfall [9], air pollution exposures [10], and stock prices [11], among other applications, but has never been applied to cutoff selection.Second, we compare commonly used methods and the proposed extreme value-based approach, for estimating the cutoffs of emerging disease tests through a simulation study and data application.We compare cutoff estimation methods based on their accuracy in achieving a target specificity, individual tests, and estimating community prevalence.We also compare the impact of target specificities on these outcomes.In our data analysis, we focus on enzyme linked immunosorbent assay (ELISA) antibody test data collected during the first year of the COVID-19 pandemic.However, the methods proposed are general and can be applied to data from any test.In our simulation study, we demonstrated the extreme value method had the least bias for estimating a cutoff for a high target specificity and that lower target specificities are easier to estimate and may perform better when the objective is estimating prevalence.

Data
We used two data sources in our analysis.The training dataset contained blood samples from staff at long-term care facilities in Colorado, USA, sampled between June and December of 2020.A total of 226 staff members underwent up to five tests each, resulting in 690 samples.Each sample was tested using three different antibody tests: a neutralization assay test and two different ELISA antibody tests.One ELISA test targeted the spike protein and the other targeted the receptor-binding domain (RBD).The neutralization assay test is considered the "gold standard" in antibody testing, so we used these results to identify positive and negative controls [12][13][14].This resulted in 245 positive controls and 445 negative controls.Additional details are given elsewhere [15].
The testing dataset consisted of samples from 186 skilled nursing staff during May 2020.Researchers collected one sample from each staff member and ran multiple antibody tests, including the spike and RBD ELISA tests used in the training dataset, as described elsewhere [16].
For both datasets, we normalized the results of the ELISA tests to account for batch effects.Each sample was run twice.We calculated the positive to negative ratio (P/N) by dividing the average optical density for each sample (P) by the average of the negative controls run on the same plate (N) as the sample This has been described in more detail elsewhere [7,16].

Statistical methods to estimate cutoff values
Our objective in determining the cutoff value is to estimate the Q quantile of the negative controls for a target specificity of Q.Let x denote the vector of n negative control test results.
(1) P/N ratio = Average of samples average of negative samples on plate .
Normal method.The normal method finds the Q quantile of a normal distribution with a mean of x and a standard deviation of s x , where x and s x denote the mean and stand- ard deviation of x , respectively [3, 5-7, 17, 18].
Lognormal method.The lognormal method is the normal method applied to the data after a natural log transformation [5,7,17].This equates to fitting a lognormal distribution to the raw data and using the Q quantile of that lognormal distribution.
MAD method.The MAD method is a modification of the normal method that replaces the mean with the median, x , and the standard deviation with the scaled mean absolute deviation (MAD), [5,6,18].This approach is intended to be robust to outliers.
Log MAD method.The log MAD method is the MAD method applied to natural log-transformed data [5,6].
Empirical method.The empirical method uses the empirical quantile of x as an estimator of the cutoff, avoiding par- ametric assumptions [5-7, 17, 18].The empirical method is the only nonparametric method widely used in the literature and the only one considered herein.
Pareto method using the upper 10% (Pareto 0.9) and upper 5% (Pareto 0.95).The Pareto method, based on extreme value theory, fits a generalized Pareto distribution to the upper tail of x .Like the normal and lognormal methods, this method fits a parametric distribution to the training data.However, it differs from those methods as the Pareto approach fits a parametric distribution only to the upper tail of the distribution of observed data.Hence, the Pareto methods focus on the part of the distribution that we are interested in rather than fitting a distribution to the center of the data and extrapolating to the tails.This approach has been shown to better approximate tail behavior in a variety of settings.
Let u denote some threshold, and y be the values in x that exceed u.Asymptotically, under regularizing conditions, y follows the generalized Pareto distribution as u approaches the upper limit of the distribution for x [8,19].The general- ized Pareto distribution is We make the simplifying assumption that ξ = 0 , which results in a shifted exponential distribution and has been shown to be preferable for small sample sizes [20].Thus, we only estimate σ u from the data as u is pre-specified.
Following prior literature, we set u to be the k th quantile of x and consider two values of k: 90 and 95 [21,22].We then fit an exponential distribution to y − u .We use maxi- mum likelihood to estimate y − u ∼ exp( ) such that ˆ = 1 ȳ−u where ȳ is the sample mean of y.Since y is assumed to be the upper (100 − k) % of the data, the upper 1−k/100 quantile of our fitted exponential distribution corresponds to the upper Q quantile of the data overall.Thus, we set the cutoff as where F −1 (Q ′ , ˆ ) is the inverse CDF of an exponen- tial distribution with a scale parameter of ˆ , evaluated at Q ′ .When Q = 0.95 and k = 95 , the cutoff estimate is equivalent to the empirical method estimate because Q ′ = 0 .The threshold k should be selected to be suffi- ciently below Q, so Q ′ itself is not an extreme quantile.The threshold must also be sufficiently large to focus on the upper tail of the distribution.
Hybrid approaches.We also consider hybrid approaches that provide a data-driven approach to select a cutoff estimation method [5,7].We first test for normality using the Shapiro-Wilk test with a significance level of 0.05.If the test fails to reject, we use the normal method.If the test rejects normality, we natural log transform and test for normality again.If the test fails to reject, we use the lognormal method.If the test rejects normality, we use one of three methods: empirical, Pareto 90%, and Pareto 95% (henceforth referred to as hybrid empirical, hybrid Pareto 0.9, and hybrid Pareto 0.95, respectively).
Additional details on the estimation methods are given in Web Appendix 1.

Statistical methods to estimate prevalence
To accurately estimate the proportion of the population with antibodies for the disease, the seroprevalence, we account for the sensitivity and specificity of the test via the Rogan-Gladen adjustment [23], modified to disallow any negative estimates.The prevalence estimator is where p is the proportion of tests classified as positive in the testing data and ŝens denotes the estimated sensitivity of the test: the proportion of the positive controls that correctly tested positive in the training data.We use the target specificity Q as the specificity estimate.While it is possible to estimate the specificity, doing so would require splitting the limited training dataset in two, one portion to estimate the cutoff and the second to estimate specificity.Further splitting limited training data is undesirable.

Data analysis
We used the training dataset to set cutoffs and evaluate the sensitivity.We established cutoffs for both the spike and RBD ELISA tests using two different target specificities: 0.95 and 0.995.For each target specificity and test, we estimated the cutoff using each of the seven methods described above and the three hybrid methods.We used the proportion of training dataset samples with positive neutralization assay results above the cutoff to estimate the sensitivity and the proportion of samples with a negative result below the cutoff to calculate the empirical specificity.
We then used the cutoffs to classify each observation in the testing dataset as positive or negative.The resulting positivity was used to calculate the Rogan-Gladen adjusted prevalence for each cutoff.

Simulation study
We modeled our simulated data after the training dataset.For each test (spike or RBD) and control type (positive or negative), we fit mixture distributions of the form where K is the number of components, π i gives the weight of each component, f i (x) is the probability den- sity function of each component evaluated at x, and g(x) is the resulting mixture distribution evaluated at x.We considered gamma, Weibull, and lognormal distributions and either two or three components.All possible combinations of these distribution were fit using the ltmix package in R for each number of components [24].We selected the best model for each in terms of BIC and visual inspection.The resulting mixture distributions are given in Supplementary Table S1.
We sampled from the fitted mixture distributions to generate data for the simulation study.By sampling from known mixture distributions, we were able to calculate the true quantiles for the population we sampled from, allowing us to assess bias and the root mean squared error (RMSE) of the cutoff value.
We considered eight scenarios in our simulation study.The data was either simulated from the fitted spike P/N ratios distribution (scenario A) or the fitted RBD P/N (6) ratios distribution (scenario B).We varied the training sample size between 50 and 200 controls of each type (positive and negative), resulting in total sample sizes of 100 and 400 and a prevalence of 0.5.We set the target specificity at 0.95 or 0.995.For each simulated training dataset, we generated a corresponding testing dataset of size 500, with the number of positive and negative controls determined by the prevalence: either 0.05 or 0.3.We generated 10,000 training datasets and testing datasets.
For each training dataset, we estimated the cutoff using all seven methods and the three hybrid methods.Then, we estimated the sensitivity of the cutoff using the proportion of the positive controls in the training dataset that were correctly predicted as positive using that cutoff.We also used each cutoff to classify positive and negative results in the testing dataset.We calculated the Rogan-Gladen adjusted prevalence as previously described.
We evaluated the cutoffs in terms of the bias and RMSE.Let X Q,s be the true Q percentile of the mixture distribution from which we simulated the negative controls, i.e., the true cutoff with a specificity of Q.We calculated the bias for each setting, s, as where C i,m,Q,s is the cutoff from the i th simulated train- ing dataset under setting s, using method m with a target specificity of Q.The RMSE was calculated as To evaluate the impact of the cutoff method on the inference drawn from the tests (the individual test results and community prevalence estimates), we used accuracy and the bias of the prevalence estimates.We calculated the accuracy of the predictions for the simulated testing datasets as the proportion of testing dataset observations that were correctly predicted for the cutoff of interest.We averaged the accuracies across the 10,000 datasets to estimate the average accuracy for each method, setting, and target quantile combination.The prevalence for each dataset, πi,m,Q,s , was calculated as in (5).For true preva- lence π s , the bias was calculated as

Data analysis
Figure 1 shows the negative control training data, positive control training data, and testing data for both the (7) Cutoff Bias m,Q,s = 10,000   spike and RBD tests.The spike test had a smaller range of P/N ratios and less separation between the positive and negative controls.The RBD negative controls had a sparser upper tail, and the positive controls had a more symmetric distribution compared to the spike test.

Spike test
Figure 2 shows the training and testing data and the estimated cutoff for each method, target specificity, and test.Web Appendix 3 shows the results in numerical form.Overall, the different estimation methods resulted   Using the hybrid approaches, we rejected normality for the untransformed and natural log transformed data and used the empirical and Pareto estimators.
Because the cutoffs are in the tail of the distribution for the negative controls, there are not many negative control observations between the cutoff values from the different methods (Fig. 2).Thus, the differences in the cutoffs have minimal impact on the empirical specificities (Table 1).The cutoffs had a larger impact on the empirical sensitivity because there were many positive controls in the range of the cutoffs as shown in Fig. 2. For example, the Pareto 0.9 and the lognormal cutoffs had similar training data empirical specificities, 0.993 versus 0.978, when targeting a specificity of 0.995.However, the empirical sensitivities were substantially different: 0.27 and 0.63, respectively.
The Rogan-Gladen adjusted prevalence estimate for each cutoff method is shown in Table 1.The prevalence estimates from cutoffs targeting a specificity of 0.95 ranged from 0.29 to 0.37.Those targeting 0.995 ranged from 0.29 to 0.64.Most prevalence estimates ranged from 0.26 to 0.42 with either target specificity, but the prevalence estimates from the empirical and Pareto cutoffs targeting a specificity of 0.995 were much larger, between 0.61 and 0.64.

RBD test
The estimated cutoffs for the RBD test were also more variable when targeting a specificity of 0.995.The MAD normal cutoffs were the smallest, and the empirical and Pareto cutoffs were similar to each other.We again rejected normality both for the raw and log transformed data, and the hybrid method estimates were equivalent to the empirical and Pareto estimates.
The RBD test showed greater separation in the distributions of the negative controls and positive controls, resulting in higher and more consistent empirical sensitivities, with all sensitivities greater than 0.87 (Table 1).The reduced variability in the empirical sensitivity estimates between estimation methods resulted in less variability of the prevalence estimates, compared to the spike tests.

Simulation study
Figure 1a-d show the distribution functions we generated data from.There was more overlap between the positive and negative cases in the data for scenario A than in scenario B. This is partially a result of the right skew of the positive controls and partially because the tail of the negative controls extends further in scenario A than in scenario B.

Cutoff estimation
Table 2 shows the bias and the RMSE of the cutoff for each method when targeting a specificity of 0.995.In the majority of cases, the Pareto methods were superior in terms of bias and RMSE.The only exception was scenario B with a training sample size of 50 where the RMSE was smallest for the lognormal method because the larger bias for this method was offset by the smaller variance.The cutoff estimates with every method were negatively biased, meaning the cutoff was below the true 0.995 quantile for each method, on average.Thus, the specificity of the estimated cutoff was below the target, on average.The MAD and log MAD methods were the most biased while the Pareto methods were the least biased.
The hybrid methods all had slightly higher RMSE and bias than their corresponding Pareto or empirical methods.Normality and log normality were both rejected for the vast majority of the datasets: 99-100% of datasets with a training sample size of 200 and 59-92% with a training sample size of 50.The results are, therefore, mostly the Pareto and empirical cutoffs but with a small number of poorer performing normal or lognormal cutoffs mixed in.
Table 3 shows, when targeting a specificity of 0.95, the magnitude of the bias and RMSE were smaller.The empirical method had the minimal bias under scenario B. The Pareto 0.9 and normal methods had a positive bias for scenario B, compared to the negative bias when targeting a specificity of 0.995.

Prevalence estimation
Table 4 shows simulation results for the Rogan-Gladen adjusted prevalence estimates when targeting a specificity of 0.995.The Pareto cutoffs had little bias but had larger variability when targeting a specificity of 0.995.In every case, the average of the prevalence point estimates was closest to the truth using one of Table 4 The bias and RMSE in parentheses of the Rogan-Gladen adjusted prevalence estimates when targeting a specificity of 0.995.The method(s) with the smallest bias in each scenario or equivalent after rounding are bolded the Pareto methods.However, in scenario A the Pareto estimates, especially with a sample size of 50, were more variable than the normal-based methods.Table 5 shows the prevalence when targeting a specificity of 0.95.The variability for the Pareto and empirical methods were lower when targeting a lower specificity, and particularly at the smaller sample size.With both target specificities, the MAD and log MAD methods were positively biased, while the other methods had a smaller bias, generally positive.The hybrid method estimates were again similar to the corresponding empirical and Pareto estimates.

Test accuracy
We consider the accuracy of the cutoff estimation methods for classifying individuals as positive or negative in the testing data.Tables 6 and 7 show the proportion of testing set observations correctly classified with a target specificity of 0.995 and 0.95, respectively.The MAD methods' cutoffs were negatively biased, leading to a lower specificity and decreased accuracy in low prevalence scenarios.The Pareto methods had the highest accuracy (or equivalent to the highest accuracy) when prevalence was 0.05.
When the prevalence was higher at 0.3 and using the lower target specificity, the Pareto method was most accurate in scenario B. All but the MAD methods performed similarly for scenario A. With the higher target specificity, the MAD cutoffs had highest accuracy for scenario A, and the lognormal method was most accurate for scenario B.

Discussion
It is imperative to rapidly develop and deploy prognostic tests for emerging infectious diseases that can be used to classify individuals and estimate prevalence in a community.A common challenge for tests is determining a cutoff value to separate positive and negative cases as there is often overlap in the results between the positive and negative cases.This is especially challenging with early tests for emerging diseases for which there is limited training data with validated positive and negative controls.Common approaches to estimating cutoff values are using the quantile of a parametric distribution fit to the negative control test data or using the empirical quantile of the negative control test data.Yet, there is little guidance on how to select a cutoff to separate positive and negative results, especially for small data sets.Here, we proposed using methods from extreme value theory, specifically using the generalized Pareto distribution to estimate the upper tail of the negative control training data and its quantiles, to estimate a cutoff value to achieve a target specificity.We compared the proposed approach and common alternatives in a simulation study.Our simulation demonstrated that when targeting a very high specificity, 0.995 as recommended by the CDC early in the COVID-19 pandemic [4], the Pareto methods proposed had lower bias and RMSE for estimating a cutoff value.When targeting a lower target specificity of 0.95, the empirical method consistently performed well.Methods that relied on parametric distributions (e.g., normal, lognormal, MAD normal and MAD lognormal) generally had large bias and RMSE.
Additionally, we compared the recommended target specificity of 0.995 to a target specificity of 0.95 and found the desired target specificity varied according to the goal of the analysis as well as the prevalence of the population.In the low prevalence setting we might expect for an emerging disease, using a higher target cutoff of 0.995, as compared to the more moderate 0.95, resulted in better accuracy for classifying individuals as positive or negative (Tables 6 and 7).With higher prevalence, accuracy was overall higher when targeting a specificity of 0.95 instead of 0.995.We also found the variability of the prevalence estimate was generally lower for the empirical and Pareto methods when targeting a specificity of 0.95.
The results of our data analysis of two COVID-19 antibody tests are consistent with the results of the simulation study.The Pareto and empirical methods, which showed minimal negative bias in the simulation study, also tended to have the highest cutoff estimates in the data analysis.The MAD methods showed considerable negative bias in the simulation study and had the smallest estimates in the data analysis.Additionally, like the simulation study, the prevalence estimates showed more variability when targeting a specificity of 0.995 rather than 0.95.
The performance of the cutoff estimators and the resulting accuracy at the individual level and prevalence estimators at the community-levels will vary depending on the shape of the distributions of positive and negative results and the separation between those two distributions.The shape of the distribution impacts how accurately the target specificity can be estimated for the methods using parametric assumptions.The separation of the distributions impacts accuracy, sensitivity, and Table 6 The mean and middle 95% (2.5% quantile, 97.5% quantile) of the accuracy of the test as measured by the proportion of testing dataset observations correctly predicted when targeting a specificity of 0.995.The method(s) with highest accuracy in each scenario or equivalent after rounding are bolded prevalence estimates.If the distributions show considerable overlap, the accuracy is lowered, and a cutoff cannot be selected that results in both a highly sensitive and highly specific test.We only generated data from two possible distributions and two possible sample sizes, so the results of our simulation study should be limited to this context.Considering the data analysis, the neutralization assay test we used to classify positive and negative controls is itself imperfect.The training dataset classifications in our data analysis could be incorrect, which would impact the cutoff estimates.
Because an emerging disease has potential cross-reactivity and few true positives expected, we focus on methods for establishing cutoffs that target a high specificity [2].However, in other applications, approaches that consider both the sensitivity and specificity, as well as the relative costs of false positive and false negative results and the prevalence, may be preferred [25][26][27][28].There are also hypothesis testing-based approaches found in the optimal cutoff selection for patient segmentation literature focused on maximizing statistical power between the groups formed by the cutoff, while controlling the Type I error rate [29][30][31].This is a distinct problem from estimating high specificity cutoffs from a sample of validated negative controls, and as such, the methods presented here are not appropriate for this problem.When only estimating prevalence, some methods forgo establishing a cutoff and instead fit a mixture model [32][33][34][35][36] or a latent class model [37][38][39] to the continuous test results.In some situations, training data may be continuously collected.Users may consider streaming algorithms for quantile estimation in these situations [40,41].
Based on our simulation and data analysis, we recommend using the Pareto methods or the empirical method to estimate the cutoff when developing tests, depending on the target specificity.The commonly used normal and MAD normal methods showed a larger bias in our simulations.The choice of target specificity of the cutoff should account for the goals of the test.Higher target specificity is preferred when prevalence is very low and the objective is to identify cases, and lower target specificity is preferred if the goal is estimating prevalence.

Fig. 1 a
Fig. 1 a-d Histogram of the training dataset for each test and control type overlaid with the corresponding mixture distribution from which the data was generated in the simulation study (training data only).The testing data set are in panels (e) and (f).The first column corresponds to the spike test, and the second to the receptor-binding domain (RBD) test.Training data was sampled from staff at long-term care facilities in Colorado, USA between June and December 2020.Testing data collected from skilled nursing staff in Colorado during May 2020

Fig. 2 P
Fig. 2 P/N ratios for the positive controls, negative controls, and testing data, jittered horizontally.Cutoffs as calculated by each of the seven methods are shown as horizontal lines.The first row shows the spike test cutoffs with a a target specificity of 0.995 and b a target specificity of 0.95.The second row shows the receptor-binding domain (RBD) test with c a specificity of 0.995 and d a target specificity of 0.95.Training data was sampled from staff at long-term care facilities in Colorado, USA between June and December 2020.Testing data collected from skilled nursing staff in Colorado during May 2020

Table 1
Rogan-Gladen adjusted prevalence estimate of the testing dataset for each cutoff method, test, and target specificity Empirical and Pareto 0.95 cutoffs are equivalent when the target specificity is 0.95 Abbreviations: MAD mean absolute deviation, RBD receptor-binding domain

Table 2
The mean and Monte Carlo standard error in parentheses of the bias and RMSE of the cutoff when targeting a specificity of 0.995.The method(s) with minimal bias and RMSE in each scenario or equivalent after rounding are bolded Abbreviations: MAD mean absolute deviation, RMSE root mean squared error

Table 3
The mean and Monte Carlo standard error in parentheses of the bias and RMSE of the cutoff when targeting a specificity of 0.95.The method(s) with minimal bias and RMSE in each scenario or equivalent after rounding are bolded Empirical and Pareto 0.95 cutoffs are equivalent when the target specificity is 0.95 Abbreviations: MAD mean absolute deviation, RMSE root mean squared error

Table 5
The bias and RMSE in parentheses of the Rogan-Gladen adjusted prevalence estimates when targeting a specificity of 0.95.The method(s) with the smallest bias in each scenario or equivalent after rounding are bolded Empirical and Pareto 0.95 cutoffs are equivalent when the target specificity is 0.95